-
Notifications
You must be signed in to change notification settings - Fork 2
Improved Code to Unmount USB Drives #529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Improved/modified code that tries to unmount the USB-attached drives before flashing the F/W. If any USB drive is "busy" the code waits until it becomes idle, up to a maximum of 150 secs (2.5 minutes).
Fine-tuning code.
|
Please review this PR when you get a chance. I have not been able to fully test it since it seems difficult to get the "right timing" for the race condition with spdMerlin to occur when the USB drive is busy while Merlin is running. I may have to create a "fake" process to recreate it. In any case, hopefully this will at least give us more data points to fine-tune the solution. P.S. |
Made improvements in function to unmount the USB drives before flashing F/W image.
So firstly, after reviewing the code I love the idea, I think we should implement something like this even if this is not the solution to the users specific problem, because it logically makes sense and I even remember asking you if you wanted to re-assess the USB unmounting solution instead of using /sbin/ejusb which I initially defended. But I won't lie, I'm not sure this fully answers what happened to the user and let me explain a bit further in detail, because his router really didn't even try to do the upgrade in his logs, it just spammed that the USB drive is busy and then nothing. The next thing we see is a reboot which he says he initiated after waiting 20+ minutes and the time dilation confirms that was the boot process next. You would expect even some normal operational logs to show up in that time frame, excluding any scripts or anything running. Usually it spams the message a bit until it does go idle, and then the update pushes through. Looking at the code, /sbin/ejusb is run unconditionally without any exit on failure, so that should mean it still would attempt to start the flash even if the eject failed. I guess the truth is because it's closed source code we don't really know if something changed between firmware versions, its a bit of a black box, but also in his logs I see no messages about the restarting of the leds like we usually do. If I remember correctly the ejusb works as follows: First Parameter (disk_port): This specifies which disk port you are referring to. Second Parameter (NAS applications restart): This determines whether or not the NAS (Network Attached Storage) applications should be restarted. -u: This is used for holding or unplugging the disk after unmount jobs are done. The default value is 1. So while I think it's absolutely valid to put more attention on the USB unmounting solution, I'm not fully convinced it's the entire answer and may be more of a red herring (call me a skeptic) until I see more evidence confirming this. I think it should be tested extensively when we can (maybe this weekend or sometime this week) and then we can determine how to respond to the user. For now I'll merge this in as code reviewed and for testing so the user can provide feedback, and also for us to test asap |
|
Seeing your leading the discussion with the user this round since I was still away, and he seems to gravitate with you more than me. Feel free to let the user know he can test in the dev branch for now to provide feedback :) |
I understand where you're coming from. Part of the issue is that we don't have the entire, unfiltered picture of what happened on the OP's router, so we have some "unknown" events. It seems to me that the OP's settings for the system log are preventing some entries from showing up in the log. He most likely has set the option "Log only messages more urgent than" to "notice" level or perhaps higher, and that would explain why some entries are missing from his logs (e.g. the LEDs events), but that doesn't mean those events didn't happen. Because of the syslog settings, we really don't get the full picture, and therefore, some assumptions were made based on previous experience. But yeah, there's still the question about why the script didn't seem to try to flash the F/W image. It does look like the router got stuck at the call to the '/sbin/ejusb' command. So the latest changes I've made attempt to unmount the USB drive until the 4-minute timeout is reached, and then if the call to /sbin/ejusb is truly stuck for some reason, the code will kill the process, and an email notification will be sent, so I'm hopeful that some traces of what happens would be found in the system log. Like you said, the command is a black box, so it's rather impossible to debug it properly and know what's happening. So essentially, at this point, we can only try to recreate the issue (somehow) to run the latest code in the develop branch and see what the log shows during our own tests. |
Well, the OP already updated his router to the latest F/W version, so the latest code in the develop branch would not even run on the router until the next 3004.388.x F/W update, and that could be several weeks from now. |
|
So far... I've been unable to recreate the issue with just the regular production version of MerlinAU as seen below, I did set spdMerlin to run every minute, and I started a MerlinAU update as soon as the speed test started: |
|
I still think the USB unmount should remain enhanced, but trying to recreate what happened exactly will be tricky, not only because of timing, but also because we couldn't "see" exactly what happened due to the filtered log. |
Yeah, getting the timing "just right" to recreate the issue will be tricky. And the OP has a different router with a different CPU and RAM available, which also affects the timing of the executing processes. I'm also wondering now if connmon also played a role in the OP's situation. The syslog he provided shows that connmon was getting executed about the same time as MerlinAU was running, which means at some point SQLite3 was executed to update the database. So, potentially, there may have been two processes (connmon and spdMerlin) running SQLite3, which, as mentioned before, uses a directory in the USB drive for temporary files and logs, and if the database files are large enough, it might take a couple of minutes to complete. But still, eventually, the file handles should get released, and the USB drive should reach the "idle" state. |
Yeah, the improved code to double-check the USB-attached drive getting unmounted should make that step more robust, and the log entries marking the events should be helpful if the same issue happens again. P.S. |
Agreed, the enhancements should stay, but validating the behavior will be hard.
I'm investigating the issue with spdMerlin now from the spdMerlin thread. It's very interesting, I'll get back to you shortly anyways with any developments. |
Great idea, sounds like SQLite3 was probably being hammered at the same time as MerlinAU |
Sounds good, I'm just catching up to that thread. When you get a chance, send the syslog as well. I'd like to take a look and compare it with the one from Ripshod when he was having the same issue. |
Find here: https://drive.google.com/file/d/1SPu3iUdyqysECfjxCOJCXftFuUGPhkIN/view |
Thanks!!! Now I need to grab my freshly-brewed cup of coffee before the meeting. |
Sounds good! . /jffs/addons/diversion/mount-entware.div # Added by amtm from his post-mount and trying again. But I can see in the log file sent that spdMerlin never triggers the NTP check after the reboot, so it's either bailing out early or never being run at all. |
We'll need to run some tests to check if the 4-minute wait is enough time to wait for most scenarios. Normally, SQLite3 calls don't take more than 2 minutes, based on what I've seen when testing spdMerlin and uiDivStats.